Sepsis is a systemic inflammatory response to infection. The terminology of sepsis is confusing as most modern criterion for sepsis include organ dysfunction, but a 1992 consensus defined “severe sepsis” as sepsis complicated by acute organ dysfunction; “severe sepsis” and sepsis are often used interchangably. [Angus, 2013]
In the hospital, patients used to often managed for sepsis based on “SIRS Criteria.” SIRS, referring to systemic inflammatory response syndrome was defined by physiologic abnormalities: temperature >98C, heart rate >90, respiratory rate that is >20 or arterial CO2 <32mmHg, and an elevated white count above 12,000/mm^3. With this criteria, sepsis was defined as SIRS + a source. This means that a culture proven infection in the context of SIRS defined a patient as having sepsis.
More recently, new criteria have been proposed. The sequential organ failure assessment (SOFA) was proposed and adopted as a prognostication tool for patients in the ICU with a focus on sepsis, but requires substantial invasive measurement. A short form of SOFA, quick SOFA (qSOFA), asks three questions similar to SIRS in order to identify patients who would benefit from a higher level of care. It is worth mentioning that SOFA used to mean “sepsis-related organ failure assessment” rather than “sequeqntial organ failure assessment” in the literature. The latest definition is known as SEPSIS-3, which was developed at the Third International Consensus for Sepsis and Septic Shock.
The murkiness of the definition around sepsis has challenged epidemiological measurement, but researchers such as Angus have attempted to measure the incidence of sepsis using ICD-9 codes that account for infection and organ dysfunction. [Angus, 2001]
In his 2001 paper the found an incidence of 3.0 cases per 1,000 people and 2.26 cases per 100 hospital discharges in 1995. They found a mortality of 28.6% although other studies quote in-hospital sepsis mortality rates at ~50%. [UpToDate]
Regardless, sepsis is a serious disease process, and the use of clinical and laboratory criteria for prognostication remains a challenging problem. The Acute Physiology, Age, and Chronic Health Evaluation (APACHE) system has been popular, with the latest version known as APACHE IV achieving an AUROC of 0.88 per the original paper. SIRS, qSOFA, and SOFA achieved AUROC’s of 0.589, 0.607, and 0.753 for in-hospital mortality in patients primarily hospitalized for an infectious disease process. [Raith, 2017]
Many of these older models are defined based on clinical expertise and consensus meetings, and predictive scores have often been derived using logistic regression. As the previously sited AUROCs suggest, they all leave room for improvement. The goal of this work is to apply modern machine learning techniques to physiologic parameters derived from cinical and laboratory data in an effort to build a better mortality predictor; specifically, rather than using SOFA’s sequential assessment, the goal was to make the prediction using only day from the first day in the ICU.
To accomplish this, the MIMIC-III database was used to develop a cohort of patients identified as having sepsis. Sepsis was defined using the ICD-9 codes developed by Angus to account for infectious processes and organ dysfunction. The outcome, as in the paper from Raith et al. above, was hospital expiration.
Clinical and laboatory data on the first day of the ICU admission were then extracted. After review of the data for errors, an exploratory data analysis was undertaken to understand if these data drawn only on the first day would be associated with the outcome of hospital expiration in order to guide the selection of a predictor set. From their an initial an initial gradient boosted tree model was built using the default parameters described in the XGBoost manual and an iteration count derived from 5-fold cross validation. This initial model was then tuned in a grid search to select the best parameters. The model was then evaluated and sensitivity analysis was perforomed, and the final model is presented for further validation on other data sources.
The MIMIC-III (Medical Information Mart for Intensive Care) database, a publicly available database maintained by the MIT Laboratory for Computational Physiology that contains health care data on 53,423 distinct hospital admissions at Beth Israel Deaconess Medical Center, was used at the primary data source. [Johnson et al., 2016]
Alistair Johnson, mentioned above in the development of OASIS, is now at the MIT Laboratory of Computational Physiology. In addition to his instrumental work in developing MIMIC-III, he has also developed open source code for extracting various clinical and laboratory paramters from the first day of a patients stay in the ICU from MIMIC-III. His colleague, Tom Pollard, also at the MIT Laboratory for Computational Physiology has developed open source code for selecting sepsis patients from MIMIC-III by various criteria including the Angus criteria mentioned above.
The code can be found here.
These SQL scripts were used to generate the following materialized views:
There is also the table of diagnoses by ICD-9 codes which was not used for model building but was examined briefly during exploratory data anlysis. The associated concept for ICD-9 codes (d_icd_diagnoses) was used to annotate these as needed.
Because the evaluators of this work will not have access to my local server environment, the above views were generated and then exported from PostgreSQL as CSV files. They will be loaded in now. Of note, some data are available in multiple places; e.g. glucose is present in both vitals and labs, lactate is present a single recording for the blood gas data whereas ranges for it are extracted in the laboratory pull. Variables will only be included once, and concepts like glucose and lactate which are best summarized in the labs table will be drawn from there instead of other souces. Also of note, unlike the other first day tables, which are summary recordings for all first day data, the ABG view contains all recorded ABGs for a patient as seperate observations. Because we want a measure predictor of disease severity it may seem obvious to take the so-called “worst” value. However, making this decision can introduce significant bias as pO2 and pCO2 are parameters which we would expect to be associated with illness status parabolically. That is to say, a very high pCO2 may indicate respiratory collapse, but a very low pCO2 will signify tachypnea which may be secondary to a variety of contexts of varying criticality. As such, a measure of central tendency will be applied. We would expect pCO2 to be fairly normally distributed as very low and very high values are associated with severe acid-base distrubances which are highly associated with death. pO2 however is less likely to obey this, and because of metabolic adaptations we should expect skew in all parameters. As such, the first recorded gas by chart time will be used. This is expected to introduce less bias as only non-differental information bias is potentially introduced.
Lastly, some patients have many admissions to the ICU, and the consensus at the MIT Laboratory for Computatonal Physiology is that building models off ICU stays after the first one can be misleading as a patient is already fairly “run down” after a single ICU course. As such, mortality predictions will be based off the first stay only.
detail <- read_csv("./data/icustay_detail.csv") %>%
select(icustay_id, hadm_id, subject_id, hadm_id, age, gender, ethnicity, los_icu,
intime, hospital_expire_flag) %>%
arrange(intime) %>%
distinct(hadm_id, .keep_all = TRUE)
angus <- read_csv("./data/angus_sepsis.csv") %>%
select(subject_id, hadm_id, angus)
vitals <- read_csv("./data/vitals.csv") %>%
select(-glucose_mean, -glucose_min, -glucose_max)
rrt <- read_csv("./data/rrt.csv")
uo <- read_csv("./data/uo.csv")
vent <- read_csv("./data/ventfirst.csv")
labs <- read_csv("./data/labs.csv")
abg_first <- read_csv("./data/abg.csv") %>%
select(icustay_id, po2, pco2, ph, charttime) %>%
arrange(charttime) %>%
distinct(icustay_id, .keep_all = TRUE) %>% select(-charttime)
gcs <- read_csv("./data/gcs.csv") %>% select(subject_id, hadm_id, icustay_id,
mingcs)
# Note: Elixhauster is a set of discrete categories for which a point is given;
# and the sum is the score.
elix <- read_csv("./data/elixhauser.csv") %>%
select(hadm_id, elixhauser = elixhauser_vanwalraven)
To build a cohort, we’ll begin by combining the angus and detail tables. This will done as an inner join: patients without an Angus status of 0 or 1 will not be included. From this base cohort, we’ll pull in all of the data we loaded above. Because we only kept the first ICU stay ID for each admission, a subject should only appear twice if they have been re-admitted. It is fine for a two admissions to appear, and we make the assumption that they represent distinct events leading to ICU admission although this may not hold in a small subset of the patients. We also make the assumption that previous admission data are uncorrelated with current admission data. This is not necessarily true, but mirror’s real life, and since this tool would be used per admission it seems reasonable to include both admissions as long as we use the first ICU stay’s data per admission.
We then apply the following exclusion criteria: 1) Age <16; our predictions are targeting adults only 2) LOS >4 hours (0.167 days); this excludes patients admitted to the ICU by mistake or for non-critical illness such as post-operatively. 3) We only keep those with angus equal to 1, as we only want to examine patients with sepsis by the Angus ICD-9 codes.
cohort <- inner_join(angus, detail, by = c("hadm_id", "subject_id")) %>%
left_join(vitals, by = c("icustay_id", "hadm_id", "subject_id")) %>%
left_join(gcs, by = c("icustay_id", "hadm_id", "subject_id")) %>%
left_join(labs, by = c("icustay_id", "hadm_id", "subject_id")) %>%
left_join(abg_first, by = c("icustay_id")) %>%
left_join(uo, by = c("icustay_id", "hadm_id", "subject_id")) %>%
left_join(rrt, by = c("icustay_id", "hadm_id", "subject_id")) %>%
left_join(vent, by = c("icustay_id", "hadm_id", "subject_id")) %>%
left_join(elix, by = c("hadm_id")) %>%
filter(angus == 1, los_icu > 0.167, age > 16 ) %>%
select(-angus, -icustay_id, -intime, -los_icu) %>%
select(subject_id, hadm_id, hospital_expire_flag, everything())
To confirm that our joins worked as expected, there should a cohort in which a single subject can appear twice, but there is only one ICU admission per admission to the hospital, and a patient should obviously only die once.
We can examine the first two quite simply; for expiration, we are performing a “sanity check.” Because the flag is 0 or 1, and because there can be multiple admissions per subject, the sum of the flag should be 0 or 1, but never more. We will use this idea to check if anyone has been coded as dying twice.
# First two checks
cohort %>% group_by(subject_id) %>% summarise(count = n()) %>% arrange(-count)
cohort %>% group_by(hadm_id) %>% summarise(count = n()) %>% arrange(-count)
# Expiration sanity check
cohort %>% group_by(subject_id) %>% summarise(expiration_sanity_check = sum(hospital_expire_flag)) %>% arrange(-expiration_sanity_check)
All of the checks suggest our dataset is okay. We no longer need subject_id’s as a subject can be present twice, but rather the admission ID (hadm_id), since this is a model for predicting mortality on admission.
Ultimately, we build a gradient boosted tree model using XGBoost. As such, we need to convert the predictors to numeric. Almost all are except for gender, which we can code as 1 or male and 0 for female. We also need to recode ethncity to make it less granular, and the dummy code the variables. Lastly, we are bound to have some ages in the cohort which are ~300. This is because HIPPA regulations require the actual age of patients of >89 years old to be hidden since, given the small size of this population, there is a risk of data leakage. The MIMIC-III guidelines say to shift any ages with these values to 91.4, the median for patients above 89 in the dataset.
cohort <- cohort %>% mutate(male_gender = as.numeric(cohort$gender == "M")) %>%
select(-gender)
ethnicities <- c("WHITE", "BLACK", "HISPANIC", "ASIAN")
for (i in 1:length(ethnicities)) {
cohort$ethnicity[str_detect(cohort$ethnicity, ethnicities[i])] <- ethnicities[i]
}
cohort$ethnicity[!(cohort$ethnicity %in% ethnicities)] <- "OTHER"
cohort <- cohort %>% mutate(white_eth = ifelse(ethnicity == "WHITE", 1, 0),
black_eth = ifelse(ethnicity == "BLACK", 1, 0),
hispanic_eth = ifelse(ethnicity == "HISPANIC", 1, 0),
asian_eth = ifelse(ethnicity == "ASIAN", 1, 0),
other_eth = ifelse(ethnicity == "OTHER", 1, 0)) %>%
select(-subject_id) %>%
select(ethnicity, hadm_id, hospital_expire_flag, male_gender, white_eth, black_eth, # we keep ethnicity here for EDA,
hispanic_eth, asian_eth, other_eth, vent, rrt, elixhauser, mingcs, age, # but will remove later since dummy coded
everything())
# Note that I used the select statement above to reorder the data. I've grouped
# variables together to the left of the continuous variables which are together.
# This is useful when you want to use subsetting and only want to look at certian
# variable types.
cohort$age[cohort$age > 89] <- 91.4
glimpse(cohort)
## Observations: 14,938
## Variables: 77
## $ ethnicity <chr> "WHITE", "WHITE", "WHITE", "WHITE", "WHIT...
## $ hadm_id <int> 128652, 165660, 185910, 101757, 174486, 1...
## $ hospital_expire_flag <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ male_gender <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1,...
## $ white_eth <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,...
## $ black_eth <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ hispanic_eth <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ asian_eth <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ other_eth <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,...
## $ vent <int> 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0,...
## $ rrt <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ elixhauser <int> 9, 7, 7, 3, 27, 15, 39, 31, 10, 10, 1, 6,...
## $ mingcs <int> 15, 15, 15, 15, 15, 8, 15, 15, 10, 15, 14...
## $ age <dbl> 72.2671, 72.7284, 75.9415, 56.6349, 62.71...
## $ heartrate_min <dbl> 48, 67, 91, 73, 58, 63, 74, 66, 60, 81, 6...
## $ heartrate_max <int> 54, 93, 129, 114, 69, 74, 86, 85, 78, 100...
## $ heartrate_mean <dbl> 51.23077, 81.48000, 105.58537, 92.23810, ...
## $ sysbp_min <int> 85, 84, 67, 98, 66, 96, 83, 84, 107, 95, ...
## $ sysbp_max <int> 146, 139, 114, 153, 172, 144, 118, 121, 1...
## $ sysbp_mean <dbl> 114.15385, 108.04167, 94.30000, 127.13636...
## $ diasbp_min <int> 48, 48, 38, 49, 40, 34, 41, 38, 46, 54, 2...
## $ diasbp_max <int> 74, 84, 76, 79, 95, 80, 89, 83, 72, 79, 7...
## $ diasbp_mean <dbl> 59.92308, 62.16667, 62.35000, 64.36364, 6...
## $ meanbp_min <dbl> 63.0000, 58.0000, 53.3333, 81.0000, 48.00...
## $ meanbp_max <dbl> 99.0000, 98.0000, 88.0000, 100.0000, 189....
## $ meanbp_mean <dbl> 79.92308, 72.54167, 73.00001, 93.61112, 8...
## $ resprate_min <int> 12, 13, 14, 11, 13, 11, 16, 16, 9, 15, 13...
## $ resprate_max <int> 13, 23, 31, 27, 27, 21, 28, 21, 28, 23, 2...
## $ resprate_mean <dbl> 12.06667, 18.65517, 24.32500, 17.50000, 1...
## $ tempc_min <dbl> 36.83334, 36.11111, 34.61111, 36.66667, 3...
## $ tempc_max <dbl> 37.05555, 37.88889, 37.88889, 38.66667, 3...
## $ tempc_mean <dbl> 36.95833, 37.16667, 36.52222, 37.12346, 3...
## $ spo2_min <int> 97, 85, 68, 93, 96, 97, 93, 100, 96, 94, ...
## $ spo2_max <int> 100, 98, 100, 100, 100, 100, 100, 100, 10...
## $ spo2_mean <dbl> 98.71429, 94.20000, 96.83784, 98.18182, 9...
## $ aniongap_min <int> 9, 9, 16, 9, 13, 15, 11, 7, 11, 14, 16, 1...
## $ aniongap_max <int> 10, 11, 17, 12, 16, 17, 17, 10, 11, 17, 1...
## $ albumin_min <dbl> 2.7, 2.8, 2.7, NA, 3.1, 3.4, 2.4, NA, 3.4...
## $ albumin_max <dbl> 2.7, 2.8, 2.7, NA, 3.1, 3.4, 2.9, NA, 3.4...
## $ bands_min <int> NA, NA, NA, 30, NA, NA, NA, NA, NA, NA, 6...
## $ bands_max <int> NA, NA, NA, 30, NA, NA, NA, NA, NA, NA, 6...
## $ bicarbonate_min <int> 24, 24, 20, 26, 16, 20, 14, 30, 23, 15, 1...
## $ bicarbonate_max <int> 27, 28, 21, 26, 19, 25, 20, 31, 23, 19, 2...
## $ bilirubin_min <dbl> 0.9, NA, 1.3, NA, 0.6, 0.5, 0.1, 0.2, 0.3...
## $ bilirubin_max <dbl> 0.9, NA, 1.3, NA, 0.6, 0.5, 0.2, 0.2, 0.3...
## $ creatinine_min <dbl> 0.7, 0.9, 1.8, 0.4, 0.9, 1.0, 0.9, 0.9, 0...
## $ creatinine_max <dbl> 0.9, 1.0, 2.6, 0.5, 1.0, 1.0, 1.9, 1.1, 0...
## $ chloride_min <int> 95, 106, 106, 104, 113, 104, 126, 101, 11...
## $ chloride_max <int> 103, 109, 107, 107, 114, 105, 140, 111, 1...
## $ glucose_min <int> 101, 101, 94, 57, 72, 104, 47, 113, 131, ...
## $ glucose_max <int> 116, 126, 137, 77, 187, 144, 369, 235, 26...
## $ hematocrit_min <dbl> 30.0, 32.7, 36.0, 28.5, 25.0, 30.0, 25.0,...
## $ hematocrit_max <dbl> 30.0, 36.5, 36.0, 33.0, 39.2, 33.2, 31.5,...
## $ hemoglobin_min <dbl> 10.6, 11.5, 12.5, 9.9, 8.4, 10.5, 7.8, 8....
## $ hemoglobin_max <dbl> 10.6, 12.7, 12.5, 11.0, 13.6, 11.2, 10.1,...
## $ lactate_min <dbl> 1.4, 0.7, 2.8, 1.1, 2.1, NA, 1.5, 2.6, 1....
## $ lactate_max <dbl> 1.5, 1.0, 2.8, 1.1, 2.1, NA, 1.5, 4.5, 2....
## $ platelet_min <int> 109, 146, 76, 177, 66, 91, 67, 69, 203, 1...
## $ platelet_max <int> 184, 174, 76, 189, 182, 97, 109, 76, 227,...
## $ potassium_min <dbl> 3.3, 3.9, 4.2, 3.7, 4.2, 3.9, 2.6, 3.3, 2...
## $ potassium_max <dbl> 3.7, 4.2, 4.4, 4.1, 5.7, 4.0, 3.7, 4.1, 4...
## $ ptt_min <dbl> 28.5, 28.0, 28.2, 23.7, 22.9, 29.3, 28.4,...
## $ ptt_max <dbl> 29.4, 61.7, 50.6, 29.5, 48.1, 30.1, 29.8,...
## $ inr_min <dbl> 1.1, 1.2, 1.4, 0.9, 1.1, 1.2, 1.3, 1.4, 1...
## $ inr_max <dbl> 1.2, 1.4, 1.7, 1.3, 1.4, 1.2, 1.4, 1.4, 1...
## $ pt_min <dbl> 13.0, 14.2, 14.4, 11.5, 13.0, 13.8, 15.2,...
## $ pt_max <dbl> 13.2, 16.2, 15.6, 13.7, 15.3, 13.9, 15.9,...
## $ sodium_min <int> 128, 139, 139, 138, 139, 138, 150, 145, 1...
## $ sodium_max <int> 133, 142, 140, 138, 142, 140, 166, 146, 1...
## $ bun_min <int> 11, 19, 42, 18, 32, 11, 43, 23, 20, 47, 5...
## $ bun_max <int> 13, 22, 55, 18, 37, 17, 77, 26, 23, 59, 5...
## $ wbc_min <dbl> 6.9, 10.9, 18.2, 4.5, 4.8, 4.2, 2.6, 2.3,...
## $ wbc_max <dbl> 6.9, 15.3, 18.2, 15.9, 13.1, 4.5, 5.0, 2....
## $ po2 <int> 148, 59, NA, 142, 122, NA, NA, 267, 53, N...
## $ pco2 <int> 36, 50, NA, 42, 40, NA, NA, 87, 57, NA, N...
## $ ph <dbl> 7.47, 7.33, NA, 7.40, 7.32, NA, NA, 7.18,...
## $ urineoutput <dbl> 3345, 2030, 560, 2305, 1990, 2330, 1225, ...
We don’t want to exclude missing data en masse: lots of specifc tests, e.g. bands which traditionally need to requested on a complete blood count (CBC), will be missing. In additin, gradient boosting can technically handle missing data in tree construction
However, we do want to exclude values which are clinically impossible. This is not to say incompatible with life (e.g. BP = 0), which would certainly expect in the critically ill at times. The key point here being that the tails of distributions for many of these parameters are real and clinically relevant and simple exclusion of outliers by say Tukey’s rule will only bias our results. So, we will carefully apply clinical knowledge to these data with a very conversative approach to discarding data (only excluding things we are certain could not be real). We will also examine how the missing data are distributed by variable to discern whether such variables should be excluded from model buidling procedures.
summary(cohort)
## ethnicity hadm_id hospital_expire_flag male_gender
## Length:14938 Min. :100011 Min. :0.0000 Min. :0.0000
## Class :character 1st Qu.:125209 1st Qu.:0.0000 1st Qu.:0.0000
## Mode :character Median :149858 Median :0.0000 Median :1.0000
## Mean :149839 Mean :0.2133 Mean :0.5311
## 3rd Qu.:174835 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :199999 Max. :1.0000 Max. :1.0000
##
## white_eth black_eth hispanic_eth asian_eth
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :1.0000 Median :0.0000 Median :0.00000 Median :0.00000
## Mean :0.7257 Mean :0.1014 Mean :0.02999 Mean :0.02497
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.00000
##
## other_eth vent rrt elixhauser
## Min. :0.000 Min. :0.000 Min. :0.00000 Min. :-14.00
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.00000 1st Qu.: 5.00
## Median :0.000 Median :1.000 Median :0.00000 Median : 11.00
## Mean :0.118 Mean :0.525 Mean :0.06172 Mean : 11.57
## 3rd Qu.:0.000 3rd Qu.:1.000 3rd Qu.:0.00000 3rd Qu.: 17.00
## Max. :1.000 Max. :1.000 Max. :1.00000 Max. : 54.00
##
## mingcs age heartrate_min heartrate_max
## Min. : 3.00 Min. :17.00 Min. : 0.35 Min. : 41.0
## 1st Qu.:14.00 1st Qu.:55.83 1st Qu.: 62.00 1st Qu.: 92.0
## Median :15.00 Median :68.78 Median : 72.00 Median :106.0
## Mean :13.51 Mean :66.90 Mean : 73.53 Mean :107.7
## 3rd Qu.:15.00 3rd Qu.:79.95 3rd Qu.: 84.00 3rd Qu.:122.0
## Max. :15.00 Max. :91.40 Max. :137.00 Max. :280.0
## NA's :171 NA's :148 NA's :149
## heartrate_mean sysbp_min sysbp_max sysbp_mean
## Min. : 34.84 Min. : 1.00 Min. : 46.0 Min. : 46.0
## 1st Qu.: 76.31 1st Qu.: 77.00 1st Qu.:130.0 1st Qu.:103.6
## Median : 87.46 Median : 87.00 Median :145.0 Median :112.7
## Mean : 88.46 Mean : 87.88 Mean :148.4 Mean :115.8
## 3rd Qu.: 99.68 3rd Qu.: 98.00 3rd Qu.:163.0 3rd Qu.:125.9
## Max. :154.97 Max. :172.00 Max. :300.0 Max. :208.0
## NA's :148 NA's :159 NA's :157 NA's :157
## diasbp_min diasbp_max diasbp_mean meanbp_min
## Min. : 1.00 Min. : 31.0 Min. : 20.72 Min. : 0.20
## 1st Qu.: 34.00 1st Qu.: 71.0 1st Qu.: 51.69 1st Qu.: 48.00
## Median : 41.00 Median : 81.0 Median : 58.09 Median : 55.67
## Mean : 41.36 Mean : 83.8 Mean : 59.00 Mean : 54.99
## 3rd Qu.: 49.00 3rd Qu.: 93.0 3rd Qu.: 65.23 3rd Qu.: 63.00
## Max. :104.00 Max. :287.0 Max. :132.33 Max. :119.00
## NA's :159 NA's :158 NA's :158 NA's :149
## meanbp_max meanbp_mean resprate_min resprate_max
## Min. : 29.0 Min. : 29.00 Min. : 1.00 Min. : 8.00
## 1st Qu.: 88.0 1st Qu.: 67.93 1st Qu.:10.00 1st Qu.:24.00
## Median :100.0 Median : 74.21 Median :13.00 Median :28.00
## Mean :105.2 Mean : 75.76 Mean :13.07 Mean :28.58
## 3rd Qu.:113.0 3rd Qu.: 82.22 3rd Qu.:16.00 3rd Qu.:32.00
## Max. :299.0 Max. :153.84 Max. :36.00 Max. :69.00
## NA's :149 NA's :149 NA's :164 NA's :159
## resprate_mean tempc_min tempc_max tempc_mean
## Min. : 7.00 Min. :15.00 Min. :32.20 Min. :31.96
## 1st Qu.:16.80 1st Qu.:35.61 1st Qu.:37.00 1st Qu.:36.39
## Median :19.36 Median :36.11 Median :37.50 Median :36.82
## Mean :19.94 Mean :36.08 Mean :37.59 Mean :36.86
## 3rd Qu.:22.54 3rd Qu.:36.61 3rd Qu.:38.17 3rd Qu.:37.31
## Max. :43.78 Max. :39.72 Max. :42.78 Max. :40.10
## NA's :159 NA's :334 NA's :334 NA's :334
## spo2_min spo2_max spo2_mean aniongap_min
## Min. : 2.00 Min. : 27.00 Min. : 11.17 Min. : 1.0
## 1st Qu.: 89.00 1st Qu.:100.00 1st Qu.: 95.92 1st Qu.:11.0
## Median : 93.00 Median :100.00 Median : 97.42 Median :13.0
## Mean : 90.42 Mean : 99.56 Mean : 96.95 Mean :13.4
## 3rd Qu.: 95.00 3rd Qu.:100.00 3rd Qu.: 98.70 3rd Qu.:15.0
## Max. :100.00 Max. :100.00 Max. :100.00 Max. :45.0
## NA's :169 NA's :170 NA's :169 NA's :118
## aniongap_max albumin_min albumin_max bands_min
## Min. : 4.00 Min. :1.000 Min. :1.000 Min. : 1.000
## 1st Qu.:14.00 1st Qu.:2.400 1st Qu.:2.500 1st Qu.: 2.000
## Median :16.00 Median :2.900 Median :3.000 Median : 6.000
## Mean :16.89 Mean :2.904 Mean :3.003 Mean : 9.204
## 3rd Qu.:19.00 3rd Qu.:3.400 3rd Qu.:3.500 3rd Qu.:13.000
## Max. :56.00 Max. :5.700 Max. :5.700 Max. :69.000
## NA's :118 NA's :7294 NA's :7294 NA's :11425
## bands_max bicarbonate_min bicarbonate_max bilirubin_min
## Min. : 1.00 Min. : 5.0 Min. : 5.00 Min. : 0.10
## 1st Qu.: 3.00 1st Qu.:18.0 1st Qu.:21.00 1st Qu.: 0.40
## Median : 8.00 Median :22.0 Median :24.00 Median : 0.60
## Mean :11.63 Mean :21.6 Mean :24.65 Mean : 1.92
## 3rd Qu.:16.00 3rd Qu.:25.0 3rd Qu.:27.00 3rd Qu.: 1.30
## Max. :79.00 Max. :50.0 Max. :53.00 Max. :79.00
## NA's :11424 NA's :74 NA's :73 NA's :5649
## bilirubin_max creatinine_min creatinine_max chloride_min
## Min. : 0.100 Min. : 0.100 Min. : 0.100 Min. : 61
## 1st Qu.: 0.400 1st Qu.: 0.700 1st Qu.: 0.900 1st Qu.: 98
## Median : 0.700 Median : 1.100 Median : 1.300 Median :102
## Mean : 2.229 Mean : 1.601 Mean : 1.949 Mean :102
## 3rd Qu.: 1.600 3rd Qu.: 1.800 3rd Qu.: 2.300 3rd Qu.:106
## Max. :82.800 Max. :28.000 Max. :29.600 Max. :145
## NA's :5649 NA's :56 NA's :56 NA's :53
## chloride_max glucose_min glucose_max hematocrit_min
## Min. : 67.0 Min. : 7.0 Min. : 37.0 Min. : 7.30
## 1st Qu.:103.0 1st Qu.: 88.0 1st Qu.: 125.0 1st Qu.:25.20
## Median :107.0 Median :106.0 Median : 158.0 Median :28.80
## Mean :107.1 Mean :112.7 Mean : 185.4 Mean :29.19
## 3rd Qu.:111.0 3rd Qu.:129.0 3rd Qu.: 210.0 3rd Qu.:33.00
## Max. :198.0 Max. :576.0 Max. :1746.0 Max. :61.00
## NA's :50 NA's :45 NA's :45 NA's :62
## hematocrit_max hemoglobin_min hemoglobin_max lactate_min
## Min. :16.40 Min. : 2.300 Min. : 4.80 Min. : 0.100
## 1st Qu.:30.20 1st Qu.: 8.400 1st Qu.: 9.90 1st Qu.: 1.075
## Median :33.90 Median : 9.600 Median :11.20 Median : 1.500
## Mean :34.54 Mean : 9.782 Mean :11.43 Mean : 1.831
## 3rd Qu.:38.10 3rd Qu.:11.100 3rd Qu.:12.70 3rd Qu.: 2.100
## Max. :64.10 Max. :19.800 Max. :21.70 Max. :24.200
## NA's :62 NA's :78 NA's :78 NA's :3567
## lactate_max platelet_min platelet_max potassium_min
## Min. : 0.300 Min. : 5.0 Min. : 9.0 Min. :0.600
## 1st Qu.: 1.500 1st Qu.: 120.0 1st Qu.: 156.0 1st Qu.:3.400
## Median : 2.300 Median : 187.0 Median : 228.0 Median :3.800
## Mean : 3.195 Mean : 205.5 Mean : 250.7 Mean :3.791
## 3rd Qu.: 3.800 3rd Qu.: 266.0 3rd Qu.: 316.0 3rd Qu.:4.200
## Max. :32.000 Max. :1592.0 Max. :1608.0 Max. :9.200
## NA's :3567 NA's :89 NA's :89 NA's :42
## potassium_max ptt_min ptt_max inr_min
## Min. : 1.900 Min. : 14.40 Min. : 14.40 Min. : 0.500
## 1st Qu.: 4.100 1st Qu.: 26.10 1st Qu.: 28.40 1st Qu.: 1.100
## Median : 4.500 Median : 29.90 Median : 34.30 Median : 1.300
## Mean : 4.714 Mean : 32.85 Mean : 46.14 Mean : 1.487
## 3rd Qu.: 5.100 3rd Qu.: 35.50 3rd Qu.: 47.60 3rd Qu.: 1.500
## Max. :19.600 Max. :150.00 Max. :150.00 Max. :22.800
## NA's :42 NA's :1303 NA's :1303 NA's :1251
## inr_max pt_min pt_max sodium_min
## Min. : 0.800 Min. : 8.00 Min. : 9.10 Min. : 95.0
## 1st Qu.: 1.200 1st Qu.: 13.00 1st Qu.: 13.50 1st Qu.:134.0
## Median : 1.400 Median : 14.10 Median : 15.10 Median :137.0
## Mean : 1.873 Mean : 15.87 Mean : 18.52 Mean :136.5
## 3rd Qu.: 1.800 3rd Qu.: 16.20 3rd Qu.: 18.60 3rd Qu.:140.0
## Max. :48.200 Max. :150.00 Max. :150.00 Max. :174.0
## NA's :1251 NA's :1248 NA's :1248 NA's :45
## sodium_max bun_min bun_max wbc_min
## Min. :108.0 Min. : 1.00 Min. : 2.00 Min. : 0.10
## 1st Qu.:137.0 1st Qu.: 15.00 1st Qu.: 18.00 1st Qu.: 6.80
## Median :140.0 Median : 24.00 Median : 29.00 Median : 9.90
## Mean :140.3 Mean : 31.26 Mean : 37.03 Mean : 11.34
## 3rd Qu.:143.0 3rd Qu.: 40.00 3rd Qu.: 48.00 3rd Qu.: 13.90
## Max. :182.0 Max. :254.00 Max. :272.00 Max. :575.80
## NA's :44 NA's :56 NA's :56 NA's :92
## wbc_max po2 pco2 ph
## Min. : 0.10 Min. : 16 Min. : 8.00 Min. :6.400
## 1st Qu.: 9.20 1st Qu.: 77 1st Qu.: 34.00 1st Qu.:7.280
## Median : 13.10 Median :114 Median : 41.00 Median :7.370
## Mean : 15.25 Mean :165 Mean : 44.15 Mean :7.347
## 3rd Qu.: 18.50 3rd Qu.:219 3rd Qu.: 49.00 3rd Qu.:7.430
## Max. :846.70 Max. :797 Max. :193.00 Max. :7.890
## NA's :92 NA's :4293 NA's :4294 NA's :4295
## urineoutput
## Min. :-2600
## 1st Qu.: 810
## Median : 1420
## Mean : 1692
## 3rd Qu.: 2282
## Max. :45680
## NA's :739
From this we can make the following observations:
cohort %>% select(urineoutput) %>% arrange(urineoutput)
There are 5 below 0; these numbers can either be zeroed or the patients can be removed. As there are ~15k admissions, it is less biased to toss the patients then to plug in a value.
cohort <- cohort %>% filter(urineoutput > 0)
There are some very high lactate values, but even personally I’ve see extreme lactate values in the ICU during shock recussitations and events such as bowel infarction.
The bloog gas values, too, are clinically reasonable. pO2 can get unresonably high in real life when FiO2 is set very high, and pCO2 appears constrained to reasonable values. pH is contrained from 6.4 (very acidic) to 7.8 (very basic), but both can occur during shock states or poisonings as respective examples.
Overall, there are bound to be data entry errors in this data, but I am more comfortable with non-differential information bias that hurts model fitting than picking and choosing what stays and introducing a selection bias.
With this basic review of the dataset complete, with most data kept given the clinically possibility of occuring, we are now ready to explore the data.
We will begin EDA by asking for simple questions of our data that get to the underlying ideas expressed in our Initial Questions section.
Sex and Age
mean(cohort$age)
## [1] 66.91273
mean(cohort$male_gender)
## [1] 0.5307556
cohort %>% ggplot() + geom_bar(position = "fill", aes(x = 0, fill = as.factor(male_gender))) +
coord_flip() + xlab("") + ylab("Proportion") +
theme(axis.title.y = element_blank(), axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
aspect.ratio = .15 ) +
scale_fill_discrete(labels = c("Male", "Female"), name = "Gender") +
ggtitle("Gender Proportion")
Ethnicity?
cohort %>% group_by(ethnicity) %>% summarise(pcount = n()) %>% ungroup() %>%
mutate(Proportion = round(pcount / sum(pcount), 2)) %>% select(-pcount, Ethnicity = ethnicity) %>% knitr::kable()
| Ethnicity | Proportion |
|---|---|
| ASIAN | 0.03 |
| BLACK | 0.09 |
| HISPANIC | 0.03 |
| OTHER | 0.12 |
| WHITE | 0.73 |
cohort %>% ggplot() + geom_bar(aes(x = as.factor(ethnicity))) +
xlab("Ethnicity") + ggtitle("Ethnicity vs. Death") +
scale_fill_brewer(palette = "Set1", labels = c("Survived", "Expired"), name = "Status")
mean(cohort$hospital_expire_flag)
## [1] 0.2065981
cohort %>% group_by(male_gender) %>% summarise(prop_died = mean(hospital_expire_flag)) %>%
ggplot() + geom_col(aes(x = as.factor(male_gender), y = prop_died))
# Better graph for making website
cohort %>% group_by(male_gender) %>% ggplot() +
geom_bar(position = "fill", aes(x = as.factor(male_gender), fill = as.factor(hospital_expire_flag))) +
coord_flip() + xlab("Gender") + ylab("Proportion") + scale_x_discrete(labels = c("Female", "Male")) +
theme(aspect.ratio = .15 ) +
ggtitle("Gender vs. Death") + scale_fill_brewer(palette = "Set1", labels = c("Survived", "Expired"), name = "Status")
There is only a slight gender difference, contrary to many other disease processes which often have gender differentials.
cohort %>% ggplot() + geom_bar(aes(x = as.factor(ethnicity), fill = as.factor(hospital_expire_flag))) +
xlab("Ethnicity") + scale_fill_discrete(labels = c("Survived", "Expired"), name = "Status") +
ggtitle("Ethnicity vs. Death") + scale_fill_brewer(palette = "Set1", labels = c("Survived", "Expired"), name = "Status")
## Scale for 'fill' is already present. Adding another scale for 'fill',
## which will replace the existing scale.
Compared to all other ethnicities, those identifying as Hispanic had a lower proprtion of death.
Asians had a slightly lower proportion than non-Asians.
The binned group “OTHER”, which contains those with “unknown” and number of other ethncities that were input into MIMIC-III has a larger proprtion of death than those identifying as other races.
We can look at the crude odds ratio for ethnicity.
cohort$ethnicity <- as.factor(cohort$ethnicity)
cohort$ethnicity <- relevel(cohort$ethnicity, ref = "WHITE")
MIMICbook::plot_OR_by_level(cohort, "ethnicity", "hospital_expire_flag") + ggtitle("OR Death, by Ethncity") +
xlab("Ethnicity")
And also , and also stratify on gender.
MIMICbook::plot_OR_by_level(cohort %>% mutate(gender = ifelse(male_gender == 1, "Male", "Female")), "ethnicity", "hospital_expire_flag", factor.var2 = "gender") + ggtitle("OR Death, by Gender and Ethncity") +
xlab("Gender") + scale_color_brewer(palette = "Set1", name = "Ethnicity")
Interestingly, the CI suggest we lose significance when we stratify ethnicity on gender.
There are certainly differences by ethnicity.
For age we’ll examine the proprtion of death by quntile of age.
groups <- 5
age_quantile <- with(cohort, cut(age,
breaks = quantile(age,
probs = seq(0,1, by = (1/groups)),
na.rm=TRUE),
include.lowest = TRUE))
cohort %>% mutate(age_group = age_quantile) %>% group_by(age_group) %>%
summarise(prop_died = mean(hospital_expire_flag)) %>%
ggplot() + geom_point(aes(x = age_group, y = prop_died)) +
xlab("Age Quintile") + ylab("Proportion Expired") +
geom_text(aes(x = age_group, y = prop_died+.005, label = round(prop_died, 2))) +
ggtitle("Age Group vs. Proportion Dead") +
geom_smooth(aes(x = age_group, y = prop_died, group = 1), method = c("lm"))
There appears to be a linear relationship between age group and proportion dying, as we would expect. We can repeat this graph and split by gender to see if we would expect an interaction.
cohort %>% mutate(age_group = age_quantile) %>% group_by(age_group, male_gender) %>%
summarise(prop_died = mean(hospital_expire_flag)) %>%
ggplot() + geom_point(aes(x = age_group, y = prop_died, color = as.factor(male_gender))) +
xlab("Age Quintile") + ylab("Proportion Expired") +
geom_text(aes(x = age_group, y = prop_died+.005, label = round(prop_died, 2))) +
ggtitle("Age Group vs. Proportion Dead, Examining Gender Interaction") +
scale_color_discrete(labels = c("Male", "Female"), name = "Gender") +
geom_smooth(aes(x = age_group, y = prop_died, group = male_gender), method = c("lm"))
The linear relationship holds, and the lines look fairly parallel suggesting interaction is likely not present and doesn’t need to be explored.
We can also look at this by ethnicity, and ethnicity stratified by gender.
Overall, the trend holds, with slightly varying slopes between groups. I suspect this is secondary to the sample size and is not suggestive of effect modification.
Lastly, lets look at Elixhauser.
groups <- 5
elix_quantile <- with(cohort, cut(elixhauser,
breaks = quantile(elixhauser,
probs = seq(0,1, by = (1/groups)),
na.rm=TRUE),
include.lowest = TRUE))
cohort %>% mutate(elix_group = elix_quantile) %>% group_by(elix_group) %>%
summarise(prop_died = mean(hospital_expire_flag)) %>%
ggplot() + geom_point(aes(x = elix_group, y = prop_died)) +
xlab("Elixhauser Quintile") + ylab("Proportion Expired") +
geom_text(aes(x = elix_group, y = prop_died+.005, label = round(prop_died, 2))) +
ggtitle("Elixhauser Group vs. Proportion Dead") +
geom_smooth(aes(x = elix_group, y = prop_died, group = 1), method = c("lm"))
As we would expect, mortality correlates strongly with Elixhauser.
Before proceeding, we can remove the ethnicity variable we saved at the front of the dataset since we will use the dummy coded version going forward.
cohort <- cohort %>% select(-ethnicity)
Before we get specifically to the association between the variables and outcome, lets see what the distribution of the continuous variables looks like.
melt(cohort %>% select(-hospital_expire_flag, -male_gender, -white_eth,
-black_eth, -hispanic_eth, -asian_eth, -other_eth, -vent, -rrt),
id.vars = "hadm_id") %>%
ggplot() + geom_histogram(aes(x = value)) + facet_wrap(~variable, scales = "free")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 73657 rows containing non-finite values (stat_bin).
This figure has a lot going on, but allows us to quickly scan for the appearance of normality and skew. As we would expect, many parameters, such as heart rate, which are clinically normal on either side of the mean, appear normally distributed. Parameters for which only one side of the mean is considered normal, such as BUN, appear quite skewed.
Regardless, we can use tests of association that assume normality. In most cases it will be appropriate because of the shape of the distribution, but where it is not, we can justify such tests by the number of patients in our cohort in reliance of the central limit theorem (CLT).
We can very eaily examine the association between the exposure and outcome using the package Table 1, and stratifying by the outcome.
CreateTableOne(vars = colnames(cohort %>% select(-hadm_id, -hospital_expire_flag)),
data = cohort, strata = "hospital_expire_flag")
## Stratified by hospital_expire_flag
## 0 1 p
## n 11183 2912
## male_gender (mean (sd)) 0.53 (0.50) 0.54 (0.50) 0.119
## white_eth (mean (sd)) 0.74 (0.44) 0.72 (0.45) 0.192
## black_eth (mean (sd)) 0.10 (0.30) 0.07 (0.25) <0.001
## hispanic_eth (mean (sd)) 0.03 (0.18) 0.02 (0.15) 0.003
## asian_eth (mean (sd)) 0.03 (0.16) 0.02 (0.15) 0.512
## other_eth (mean (sd)) 0.11 (0.31) 0.16 (0.37) <0.001
## vent (mean (sd)) 0.51 (0.50) 0.62 (0.49) <0.001
## rrt (mean (sd)) 0.04 (0.20) 0.06 (0.24) <0.001
## elixhauser (mean (sd)) 10.65 (8.15) 14.31 (8.57) <0.001
## mingcs (mean (sd)) 13.62 (2.66) 13.04 (3.38) <0.001
## age (mean (sd)) 65.98 (16.68) 70.48 (15.34) <0.001
## heartrate_min (mean (sd)) 73.21 (15.46) 75.04 (18.52) <0.001
## heartrate_max (mean (sd)) 106.48 (21.30) 112.99 (24.52) <0.001
## heartrate_mean (mean (sd)) 87.58 (16.23) 92.23 (18.13) <0.001
## sysbp_min (mean (sd)) 89.80 (17.28) 81.80 (19.71) <0.001
## sysbp_max (mean (sd)) 149.61 (24.89) 145.50 (25.99) <0.001
## sysbp_mean (mean (sd)) 117.41 (16.76) 111.26 (17.24) <0.001
## diasbp_min (mean (sd)) 42.44 (11.52) 38.38 (12.26) <0.001
## diasbp_max (mean (sd)) 84.51 (18.90) 82.41 (20.77) <0.001
## diasbp_mean (mean (sd)) 59.86 (10.61) 56.88 (10.70) <0.001
## meanbp_min (mean (sd)) 56.27 (13.62) 51.10 (15.28) <0.001
## meanbp_max (mean (sd)) 105.58 (28.41) 105.40 (33.63) 0.766
## meanbp_mean (mean (sd)) 76.71 (11.13) 73.40 (11.46) <0.001
## resprate_min (mean (sd)) 12.95 (3.86) 13.63 (4.41) <0.001
## resprate_max (mean (sd)) 28.18 (6.71) 30.19 (7.42) <0.001
## resprate_mean (mean (sd)) 19.64 (4.23) 21.21 (4.78) <0.001
## tempc_min (mean (sd)) 36.14 (0.84) 35.89 (1.04) <0.001
## tempc_max (mean (sd)) 37.63 (0.86) 37.48 (1.02) <0.001
## tempc_mean (mean (sd)) 36.90 (0.68) 36.71 (0.86) <0.001
## spo2_min (mean (sd)) 91.34 (7.81) 87.44 (13.56) <0.001
## spo2_max (mean (sd)) 99.62 (0.93) 99.35 (2.41) <0.001
## spo2_mean (mean (sd)) 97.22 (2.07) 96.01 (4.92) <0.001
## aniongap_min (mean (sd)) 12.92 (3.20) 14.67 (4.43) <0.001
## aniongap_max (mean (sd)) 16.32 (4.57) 18.50 (5.90) <0.001
## albumin_min (mean (sd)) 2.98 (0.66) 2.68 (0.70) <0.001
## albumin_max (mean (sd)) 3.07 (0.65) 2.81 (0.70) <0.001
## bands_min (mean (sd)) 8.94 (9.10) 10.17 (10.80) 0.001
## bands_max (mean (sd)) 11.31 (11.09) 12.94 (12.99) <0.001
## bicarbonate_min (mean (sd)) 21.96 (5.37) 20.04 (6.20) <0.001
## bicarbonate_max (mean (sd)) 24.91 (5.02) 23.44 (5.69) <0.001
## bilirubin_min (mean (sd)) 1.54 (3.84) 3.22 (6.45) <0.001
## bilirubin_max (mean (sd)) 1.80 (4.27) 3.75 (7.16) <0.001
## creatinine_min (mean (sd)) 1.45 (1.41) 1.67 (1.29) <0.001
## creatinine_max (mean (sd)) 1.79 (1.72) 2.00 (1.48) <0.001
## chloride_min (mean (sd)) 102.15 (6.81) 101.94 (7.39) 0.146
## chloride_max (mean (sd)) 107.26 (6.94) 107.22 (7.54) 0.783
## glucose_min (mean (sd)) 113.15 (39.11) 113.61 (46.36) 0.586
## glucose_max (mean (sd)) 184.50 (109.07) 190.38 (103.31) 0.009
## hematocrit_min (mean (sd)) 29.29 (5.95) 28.81 (6.18) <0.001
## hematocrit_max (mean (sd)) 34.67 (5.87) 34.25 (6.11) 0.001
## hemoglobin_min (mean (sd)) 9.85 (2.03) 9.60 (2.06) <0.001
## hemoglobin_max (mean (sd)) 11.50 (2.04) 11.28 (2.08) <0.001
## lactate_min (mean (sd)) 1.61 (0.96) 2.50 (2.16) <0.001
## lactate_max (mean (sd)) 2.82 (2.17) 4.37 (3.90) <0.001
## platelet_min (mean (sd)) 209.74 (120.67) 186.29 (130.37) <0.001
## platelet_max (mean (sd)) 254.86 (137.65) 232.92 (148.37) <0.001
## potassium_min (mean (sd)) 3.76 (0.58) 3.86 (0.67) <0.001
## potassium_max (mean (sd)) 4.67 (0.97) 4.82 (1.02) <0.001
## ptt_min (mean (sd)) 31.76 (11.25) 35.97 (15.35) <0.001
## ptt_max (mean (sd)) 43.97 (29.46) 52.32 (34.98) <0.001
## inr_min (mean (sd)) 1.43 (0.70) 1.66 (1.09) <0.001
## inr_max (mean (sd)) 1.76 (1.57) 2.20 (2.12) <0.001
## pt_min (mean (sd)) 15.45 (5.67) 17.09 (7.99) <0.001
## pt_max (mean (sd)) 17.81 (10.59) 20.67 (14.01) <0.001
## sodium_min (mean (sd)) 136.53 (5.44) 136.28 (6.31) 0.033
## sodium_max (mean (sd)) 140.33 (5.37) 140.27 (6.10) 0.617
## bun_min (mean (sd)) 28.86 (22.64) 38.81 (26.94) <0.001
## bun_max (mean (sd)) 34.51 (26.09) 44.64 (29.93) <0.001
## wbc_min (mean (sd)) 10.90 (7.21) 12.78 (16.25) <0.001
## wbc_max (mean (sd)) 14.72 (10.60) 17.10 (22.97) <0.001
## po2 (mean (sd)) 170.14 (123.16) 152.28 (114.45) <0.001
## pco2 (mean (sd)) 44.50 (16.44) 42.79 (17.04) <0.001
## ph (mean (sd)) 7.35 (0.11) 7.33 (0.13) <0.001
## urineoutput (mean (sd)) 1838.06 (1346.52) 1195.46 (1081.46) <0.001
## Stratified by hospital_expire_flag
## test
## n
## male_gender (mean (sd))
## white_eth (mean (sd))
## black_eth (mean (sd))
## hispanic_eth (mean (sd))
## asian_eth (mean (sd))
## other_eth (mean (sd))
## vent (mean (sd))
## rrt (mean (sd))
## elixhauser (mean (sd))
## mingcs (mean (sd))
## age (mean (sd))
## heartrate_min (mean (sd))
## heartrate_max (mean (sd))
## heartrate_mean (mean (sd))
## sysbp_min (mean (sd))
## sysbp_max (mean (sd))
## sysbp_mean (mean (sd))
## diasbp_min (mean (sd))
## diasbp_max (mean (sd))
## diasbp_mean (mean (sd))
## meanbp_min (mean (sd))
## meanbp_max (mean (sd))
## meanbp_mean (mean (sd))
## resprate_min (mean (sd))
## resprate_max (mean (sd))
## resprate_mean (mean (sd))
## tempc_min (mean (sd))
## tempc_max (mean (sd))
## tempc_mean (mean (sd))
## spo2_min (mean (sd))
## spo2_max (mean (sd))
## spo2_mean (mean (sd))
## aniongap_min (mean (sd))
## aniongap_max (mean (sd))
## albumin_min (mean (sd))
## albumin_max (mean (sd))
## bands_min (mean (sd))
## bands_max (mean (sd))
## bicarbonate_min (mean (sd))
## bicarbonate_max (mean (sd))
## bilirubin_min (mean (sd))
## bilirubin_max (mean (sd))
## creatinine_min (mean (sd))
## creatinine_max (mean (sd))
## chloride_min (mean (sd))
## chloride_max (mean (sd))
## glucose_min (mean (sd))
## glucose_max (mean (sd))
## hematocrit_min (mean (sd))
## hematocrit_max (mean (sd))
## hemoglobin_min (mean (sd))
## hemoglobin_max (mean (sd))
## lactate_min (mean (sd))
## lactate_max (mean (sd))
## platelet_min (mean (sd))
## platelet_max (mean (sd))
## potassium_min (mean (sd))
## potassium_max (mean (sd))
## ptt_min (mean (sd))
## ptt_max (mean (sd))
## inr_min (mean (sd))
## inr_max (mean (sd))
## pt_min (mean (sd))
## pt_max (mean (sd))
## sodium_min (mean (sd))
## sodium_max (mean (sd))
## bun_min (mean (sd))
## bun_max (mean (sd))
## wbc_min (mean (sd))
## wbc_max (mean (sd))
## po2 (mean (sd))
## pco2 (mean (sd))
## ph (mean (sd))
## urineoutput (mean (sd))
With each association calculated, we can see which variables were signicantly associated at an alpha cutoff of <0.05. 59 of the 66 continuous variables are associated with hospital death. As we can see, vent and rrt are associated, and some, but not all of the race categories are associated. Male gender was not associated.
We can start by simply asking, are there any significant correlations? We can automate this search with a loop (adapted from StackOverflow). We’ll use a p-value cutoff of 0.5 for signifance, and only look for “strong” relationships where |r| > 0.7. We’ll also view the output of our scan as a plot, asking ggplot2 to fit a regression line for all of the associations that meet criteria.
Note: we’ll use some simple string detection to try and avoid things that are obviously correlated like heartrate_max and heartrate_mean.
cont_var <- colnames(cohort %>% select(-hadm_id, -hospital_expire_flag, -male_gender, -white_eth, -black_eth,
-hispanic_eth, -asian_eth, -other_eth,
-vent, -rrt))
correlations <- rcorr(as.matrix(cohort[, cont_var]))
m <- length(cont_var)
for (i in 1:m) {
for (j in 1:m) {
if (!is.na(correlations$P[i,j])) {
if (correlations$P[i,j] < 0.05 & abs(correlations$r[i,j]) > 0.7) {
xparam <- str_split(rownames(correlations$P)[i], "_")[[1]][1]
yparam <- str_split(rownames(correlations$P)[j], "_")[[1]][1]
if (!identical(xparam, yparam)) {
p <- ggplot(cohort) +
geom_point(aes(x = get(rownames(correlations$P)[i]), y = get(colnames(correlations$P)[j]))) +
xlab(rownames(correlations$P)[i]) + ylab(colnames(correlations$P)[j]) +
geom_smooth(aes(x = get(rownames(correlations$P)[i]), y = get(colnames(correlations$P)[j])), method = c("lm"))
print(p)
}
}
}
}
}
## Warning: Removed 9 rows containing non-finite values (stat_smooth).
## Warning: Removed 9 rows containing missing values (geom_point).
## Warning: Removed 11 rows containing non-finite values (stat_smooth).
## Warning: Removed 11 rows containing missing values (geom_point).
## Warning: Removed 10 rows containing non-finite values (stat_smooth).
## Warning: Removed 10 rows containing missing values (geom_point).
## Warning: Removed 11 rows containing non-finite values (stat_smooth).
## Warning: Removed 11 rows containing missing values (geom_point).
## Warning: Removed 9 rows containing non-finite values (stat_smooth).
## Warning: Removed 9 rows containing missing values (geom_point).
## Warning: Removed 10 rows containing non-finite values (stat_smooth).
## Warning: Removed 10 rows containing missing values (geom_point).
## Warning: Removed 33 rows containing non-finite values (stat_smooth).
## Warning: Removed 33 rows containing missing values (geom_point).
## Warning: Removed 60 rows containing non-finite values (stat_smooth).
## Warning: Removed 60 rows containing missing values (geom_point).
## Warning: Removed 60 rows containing non-finite values (stat_smooth).
## Warning: Removed 60 rows containing missing values (geom_point).
## Warning: Removed 60 rows containing non-finite values (stat_smooth).
## Warning: Removed 60 rows containing missing values (geom_point).
## Warning: Removed 60 rows containing non-finite values (stat_smooth).
## Warning: Removed 60 rows containing missing values (geom_point).
## Warning: Removed 1177 rows containing non-finite values (stat_smooth).
## Warning: Removed 1177 rows containing missing values (geom_point).
## Warning: Removed 1177 rows containing non-finite values (stat_smooth).
## Warning: Removed 1177 rows containing missing values (geom_point).
## Warning: Removed 1177 rows containing non-finite values (stat_smooth).
## Warning: Removed 1177 rows containing missing values (geom_point).
## Warning: Removed 1177 rows containing non-finite values (stat_smooth).
## Warning: Removed 1177 rows containing missing values (geom_point).
## Warning: Removed 33 rows containing non-finite values (stat_smooth).
## Warning: Removed 33 rows containing missing values (geom_point).
The only strong significant associations are between variables which are, by definition or nature, linked: 1) Systolic and diastolic pressure define MAP 2) Sodium and chloride are intrinsically linked as a cation-anion pair and stay in balance. 3) Hemoglobin is a measure of the concentration of hemoglobin in the blood, and hematocrit is the fraction of the blood volume which is red blood cells. As such, they are intrinsically linked by definition. 4) PT and PTT test the two main pathways in the coagulation cascade (extrinsic, and intrinsic), and INR is a normalized version of PT done at each laboratory in order to control for the susceptibility of PT to chemical factors in each laboratory. Thus INR and PT are intrinsically linked by definition.
With respect to point 4, we do note two seperate lines on the graphs for PT and INR. Because of the way INR is derived, this is likely capturing a change that occurred in the laboratory standard, and doesn’t warrant further exploration.
Now lets perform a principle component analysis.
cohort_mat <- as.matrix(cohort[,3:76])
cohort_mat[is.na(cohort_mat)] <- 0 # An unfortunate requirement of PCA
cohort_pca <- prcomp(cohort_mat, center = TRUE, scale = TRUE)
autoplot(cohort_pca, x = 1, y = 2)
autoplot(cohort_pca, x = 3, y = 4)
Looking at the first 4 PCs we see that while there appears to be some group separation, the groups that separate off are quite small. As this PCA was performed to explore the data and not to specifically improve model buidling, and since the discriminatory effect appears minimal, this approach won’t be developed further.
In our EDA we’ve learned the following: 1) Demographic factors seem to matter for hospital mortality. 2) Most of the variables are univariately associated with the outcome. 3) There does not appear to be a significant amount of correlation between variables past the obvious associations discussed. 4) The first few PC’s obtained from PCA do not very clearly separate the populations, although future work could attempt to identify what the little off-shoots are.
With respect to point 2), some non-significant variables are worth excluding, but not all of them. Chloride_min and max can go; they’re not associated and chloride_mean will remain. However, mean_bp_max, glucose_min, sodium_min, and sodium_max all seem clinically important, and will be kept at least until their lack of importance has been demonstrated in the GBM importance matrix. In addition, Elixhauser, while not predicting death, is expected to be important in stratifying patients because it is the comorbidity burden, and should be kept for examination in the model. Gender did not seem important, but we can also let this be examined in the model, and since some of the race dummy vars were significant, we’ll keep them all.
cohort <- cohort %>% select(-chloride_min, -chloride_max)
With respect to point 3) we should remove variables that are “redundant” by nature of deinition. As such, we’ll keep MAP dropping diastolic and systolic pressures, INR dropping PT, and hemoglobin dropping hematocrit.
cohort <- cohort %>% select(-diasbp_min, -diasbp_max, -diasbp_mean, -sysbp_min,
-sysbp_max, -sysbp_mean, -pt_min, -pt_max,
-hematocrit_min, -hematocrit_max)
Finally, before we move onto building our model, lets revist the amount of NA present in our data and see if it is worth not using a varaible that has substantial missing data, and could thus do more harm to model fitting than good.
sapply(cohort, function(x) sum(is.na(x)))
## hadm_id hospital_expire_flag male_gender
## 0 0 0
## white_eth black_eth hispanic_eth
## 0 0 0
## asian_eth other_eth vent
## 0 0 0
## rrt elixhauser mingcs
## 0 0 15
## age heartrate_min heartrate_max
## 0 0 1
## heartrate_mean meanbp_min meanbp_max
## 0 0 0
## meanbp_mean resprate_min resprate_max
## 0 16 11
## resprate_mean tempc_min tempc_max
## 11 167 167
## tempc_mean spo2_min spo2_max
## 167 13 14
## spo2_mean aniongap_min aniongap_max
## 13 96 96
## albumin_min albumin_max bands_min
## 6889 6889 10780
## bands_max bicarbonate_min bicarbonate_max
## 10779 54 53
## bilirubin_min bilirubin_max creatinine_min
## 5328 5328 37
## creatinine_max glucose_min glucose_max
## 37 31 31
## hemoglobin_min hemoglobin_max lactate_min
## 60 60 3335
## lactate_max platelet_min platelet_max
## 3335 67 67
## potassium_min potassium_max ptt_min
## 27 27 1224
## ptt_max inr_min inr_max
## 1224 1177 1177
## sodium_min sodium_max bun_min
## 29 28 38
## bun_max wbc_min wbc_max
## 38 71 71
## po2 pco2 ph
## 4003 4004 4005
## urineoutput
## 0
While some of the blood gas associated values are missing for a litle under a third of the cohort, the major issues are bands, albumin, and bilirubin. Clinically, bands are an excellent indicator of an infectious process, but they will probably not help us much in our modeling, especially in a cohort with infection required as an inclusion criteria. Furthermore, bilirubin and albumin have clinical utility, but are not routinely collected on initial admission unless the history of present illness suggest their necessity, and thus they can also be removed.
cohort <- cohort %>% select(-bands_min, -bands_max, -bilirubin_min,
-bilirubin_max, -albumin_min, -albumin_max)
We’ll keep the ABG associated tests, despite them missing for so many, because ABG’s tell us so much about acid-base physiology which is important in all medicine, and especially critical care. That said, if they are not found to be important to the GBM, we will remove them.
With that, we are ready to move onto model building.
First, lets make a training set.
set.seed(10)
indices <- createDataPartition(y = cohort$hospital_expire_flag, p = 0.9)$Resample
train <- cohort %>% dplyr::slice(indices)
test <- cohort %>% dplyr::slice(-indices)
These data need to be converted to matrices in order to be used with xgboost.
train_label <- train$hospital_expire_flag
test_label <- test$hospital_expire_flag
train_mat <- as.matrix(train[,3:58])
test_mat <- as.matrix(test[,3:58])
train_xgmat <- xgb.DMatrix(data = train_mat, label = train_label)
test_xgmat <- xgb.DMatrix(data = test_mat, label = test_label)
We begin by setting starting paramters. These starting points were chosen based on prior work and reading.
params <- list(booster = "gbtree", objective = "binary:logistic",
eta = 0.3, gamma = 0, max_depth = 10, min_child_weight = 3,
subsample = 0.5, colsample_bytree = 0.5)
Using these baseline parameters we’ll build our first mode.
mort_gbm1 <- xgb.train(params = params, data = train_xgmat, nround = 100,
watchlist = list(train = train_xgmat),
print_every_n = 10, early_stopping_rounds = 50,
maximize = T, eval_metric = "auc")
## [1] train-auc:0.754550
## Will train until train_auc hasn't improved in 50 rounds.
##
## [11] train-auc:0.900717
## [21] train-auc:0.942498
## [31] train-auc:0.969411
## [41] train-auc:0.983981
## [51] train-auc:0.991971
## [61] train-auc:0.995228
## [71] train-auc:0.998212
## [81] train-auc:0.999105
## [91] train-auc:0.999659
## [100] train-auc:0.999916
We now evaluate this first model.
pred <- predict(mort_gbm1, test_xgmat)
roc_gbm1 <- roc(as.numeric(pred > 0.5), test_label)
roc_gbm1$auc
## Area under the curve: 0.6912
ggroc(roc_gbm1) + geom_abline(slope = 1, intercept = 1)
confusionMatrix(as.numeric(pred > 0.5), test_label)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1031 200
## 1 81 97
##
## Accuracy : 0.8006
## 95% CI : (0.7787, 0.8211)
## No Information Rate : 0.7892
## P-Value [Acc > NIR] : 0.1556
##
## Kappa : 0.2974
## Mcnemar's Test P-Value : 1.932e-12
##
## Sensitivity : 0.9272
## Specificity : 0.3266
## Pos Pred Value : 0.8375
## Neg Pred Value : 0.5449
## Prevalence : 0.7892
## Detection Rate : 0.7317
## Detection Prevalence : 0.8737
## Balanced Accuracy : 0.6269
##
## 'Positive' Class : 0
##
This first model is rather specific, but poorly sensitive. This follows logically given the prevalence of death in the cohort. However, we saw that the training AUC was quite good; we can introduce regularization to help with this issue.
While a grid search of the parameter space is the oft recommended approach to hyperparameter tuning, it is rather inefficent. In addition to the computational burden, it has been suggested in the literature that it doesn’t make sense given that most hyperparameters “do not matter much.” Bergstra and Benigo describe this in the Journal of Machine Learning.
Instead, we’ll draw 1,000 random samples from the parameter space. We’ll optimize gamma, the regularization parameter, as well as the learning rate (eta), subsample, sample by tree, our tree depth, and our the minimum child weight.
note: I’ve set this to eval false for knitting since I recorded the values I obtained when initial coding this I’ve done this because runing it takes more than an hour
parallelStartSocket(cpus = detectCores())
parallelLibrary("mlr")
traintask <- makeClassifTask (data = train, target = "hospital_expire_flag")
testtask <- makeClassifTask (data = test, target = "hospital_expire_flag")
xgb_lrn <- makeLearner("classif.xgboost", predict.type = "response")
xgb_lrn$par.vals <- list(booster = "gbtree", objective = "binary:logistic",
nrounds = 50L, eval_metric = "auc")
params <- makeParamSet(makeIntegerParam("max_depth", lower = 10L, upper = 15L),
makeIntegerParam("gamma", lower = 10L, upper = 20L),
makeIntegerParam("min_child_weight", lower = 15L, upper = 25L),
makeNumericParam("eta", lower = 0.2, upper = 0.3),
makeNumericParam("subsample", lower = 0.3, upper = 0.5),
makeNumericParam("colsample_bytree", lower = 0.3, upper = 0.5))
resamp <- makeResampleDesc("CV", stratify = T, iters = 5L)
tune_control <- makeTuneControlRandom(maxit = 1000)
tune <- tuneParams(learner = xgb_lrn, task = traintask, resampling = resamp,
measures = acc, par.set = params, control = tune_control,
show.info = T)
parallelStop()
tuned_lrn <- setHyperPars(xgb_lrn, par.vals = tune$x) # assign the optimal parameters
mort_gbm2 <- mlr::train(learner = tuned_lrn, task = traintask)
Because the above search is random, and it was tweaked and run multiple times initially, re-compilation of this file may result in changes to the output. Becuase I wanted to write up this work, I’ve manually saved the output I recieved.
The initial output I recieved and carried forward in my analysis is as follows:
max_depth = 10 gamma = 14 min_child_weight = 25 eta = 0.267 subsample = 0.382 colsample_bytree = 0.443
params <- list(booster = "gbtree", objective = "binary:logistic",
eta = 0.267, gamma = 14, max_depth = 10, min_child_weight = 25,
subsample = 0.382, colsample_bytree = 0.443)
We have yet to optimize our iteration count, and so finally, we’ll use 5-fold cross validation to set that.
nround_cv <- xgb.cv(params = params, data = train_xgmat, nrounds = 200,
nfold = 5, showsd = T, stratified = T, print_every_n = 10,
early_stopping_rounds = 50, maximize = F, metrics = "error")
## [1] train-error:0.205167+0.004108 test-error:0.209284+0.010468
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 50 rounds.
##
## [11] train-error:0.190328+0.001622 test-error:0.195017+0.008555
## [21] train-error:0.184790+0.001665 test-error:0.190840+0.007913
## [31] train-error:0.183687+0.001622 test-error:0.190367+0.007907
## [41] train-error:0.182859+0.002162 test-error:0.190288+0.007712
## [51] train-error:0.183174+0.001961 test-error:0.189421+0.008163
## [61] train-error:0.182859+0.001787 test-error:0.190130+0.007444
## [71] train-error:0.182248+0.002187 test-error:0.189500+0.006770
## [81] train-error:0.182090+0.002154 test-error:0.189105+0.006296
## [91] train-error:0.181775+0.002330 test-error:0.190052+0.006154
## [101] train-error:0.181972+0.002575 test-error:0.190130+0.006109
## [111] train-error:0.181184+0.002523 test-error:0.189894+0.005099
## Stopping. Best iteration:
## [69] train-error:0.182347+0.002121 test-error:0.188712+0.007325
mort_gbm2 <- xgb.train(params = params, data = train_xgmat, nround = nround_cv$best_iteration,
watchlist = list(train = train_xgmat),
print_every_n = 10, early_stopping_rounds = 50,
maximize = T, eval_metric = "auc")
## [1] train-auc:0.656760
## Will train until train_auc hasn't improved in 50 rounds.
##
## [11] train-auc:0.756205
## [21] train-auc:0.763559
## [31] train-auc:0.769669
## [41] train-auc:0.772306
## [51] train-auc:0.772530
## [61] train-auc:0.774704
## [69] train-auc:0.776851
pred <- predict(mort_gbm2, test_xgmat)
roc <- roc(as.numeric(pred > 0.5), test_label)
ggroc(roc) + geom_abline(slope = 1, intercept = 1)
roc$auc
## Area under the curve: 0.7463
Our final model achieves an AUROC of 0.7463, just under the AUROC of SOFA. It is worth noting that our initial model had an AUC of 0.6912, and so tuning was very useful.
imp_mat <- xgb.importance(feature_names = colnames(train_xgmat), model = mort_gbm2)
xgb.ggplot.importance(imp_mat)
The features ultimately selected are shown above. The selected features make a great deal of sense: burden of comorbidity and age lead the pack. Our EDA demonstrated the importance of these features. It is also noteworthy that ethnicity emerged, as our EDA might have predicted.
The most noteworthy conclusion of this initial work is that only physiological data and clinical assessments (Elixhauser, GCS) were kept by the algorithim. Laboratory data were not useful, falling below even ethnicity.